[Day 3] - 掌握文本前處理基本功：爬蟲 & 字串處理 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 3

自我挑戰組

NLP 新手的 30 天入門養成計畫系列第 3 篇

[Day 3] - 掌握文本前處理基本功：爬蟲 & 字串處理

16th鐵人賽

sfg

2024-08-08 19:02:49

406 瀏覽

分享至

在了解什麼是 NLP 以及它可以用在那些地方之後，我們可以進一步思考如何實際把文本變成電腦看的懂、可以處理的資料。

我們在日常生活中，隨時會接觸到許多文本資料，像是新聞、教科書、貼文或我正在寫的這篇文章，然而這些文本並沒有辦法直接丟給模型訓練或是建立檢索系統，它需要經過文本前處理 ( Text Preprocessing ) 的流程。

就如同機器學習要訓練模型一樣，我們需要對一大筆沒有經過任何處理的資料 ( Raw Data ) 預先進行資料前處理。這個階段是相對容易但蠻麻煩的，需要注意檔案使用的編碼、每一個欄位的數值型態、缺失值、離群值、正規化等等，全部處理完之後才能用於模型訓練。

我曾經上課的時候聽教授說過：”Garbage in, garbage out.”，好資料不一定訓練出好模型，但爛資料肯定訓練出爛模型，而 NLP 領域也是如此。

舉例來說，表情符號 ”😂” 在文本中的 UTF-8 編碼是 “\u1F602”，當它以編碼型態顯示的時候，不管人類或電腦都看不懂是什麼意思，除非我們把它替換成 “face with tear of joy” 或乾脆刪除。

我覺得文本前處理算是入門 NLP 第一個要學習的基本功，因為它沒有很難所以直接用程式碼實作吧！

我爬了前幾天奧運羽球男雙金牌的新聞當作例子：

import requests
from bs4 import BeautifulSoup

url = 'https://www.taiwannews.com.tw/news/5914017'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers = headers)

if response.status_code == 200:
  soup = BeautifulSoup(response.text, 'html.parser')
  article = soup.find('div', class_ = 'editor __className_11742b')
  paragraphs = article.find_all('p')
  content = ' '.join([p.get_text() for p in paragraphs])
  print(content)

"""TAIPEI (Taiwan News) — Taiwan's badminton duo on Sunday (Aug. 4) captured the country's first Olympic gold in the Paris Olympics and are the first Taiwanese shuttlers to win back-to-back Olympic golds. In the men's badminton doubles finals, Lee Yang (李洋) and Wang Chi-lin (王齊麟)..."""

它很明顯的是非結構化資料，裡面有很多像是大小寫不一致、各種標點符號還有中英文混合的問題，所以接下來我們要一一處理掉。

Lowercasing

首先要做的就是將所有文字都轉成小寫，目的是要保持單詞的一致性。

我們在後面有很重要的一項步驟叫做斷詞 ( tokenization )，也就是將完整的句子切割成一塊一塊有意義的 token，如果不先轉成小寫的話，拼字相同但大小寫不同的單詞會被視為不一樣的單詞，對後續模型判斷造成影響。

content = content.lower()
print(content)

"""taipei (taiwan news) — taiwan's badminton duo on sunday (aug. 4) captured the country's first olympic gold in the paris olympics and are the first taiwanese shuttlers to win back-to-back olympic golds. in the men's badminton doubles finals, lee yang (李洋) and wang chi-lin (王齊麟)..."""

Remove Punctuations

然後是移除標點符號，如同剛剛一樣，我們希望最終提取出來的是最精華且有意義的單詞，所以要提前把標點符號去除掉。

我使用的是 string 裡面提供的標點符號字串 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~，當然也可以自己定義，比方說我發現這篇新聞裡面有包含 — 的標點符號，但 string.punctuation 裡面沒有，就可以額外手動加上去。

import string
punctuation_str = string.punctuation + "—"
content = ''.join([word for word in content if word not in punctuation_str])
print(content)

"""taipei taiwan news  taiwans badminton duo on sunday aug 4 captured the countrys first olympic gold in the paris olympics and are the first taiwanese shuttlers to win backtoback olympic golds in the mens badminton doubles finals lee yang 李洋 and wang chilin 王齊麟..."""

不過，我也有發現一些小問題，比方說時間 11:11 變成了 1111，所以要進行額外處理或是不做這個流程，就看會不會影響你後續要進行的任務來決定。

Remove Chinese characters

最後，我們把中文字去除掉，因為目前並不需要用到：

import re
content = re.sub(r'[\u4e00-\u9fff]+', '', content)
print(content)

"""taipei taiwan news  taiwans badminton duo on sunday aug 4 captured the countrys first olympic gold in the paris olympics and are the first taiwanese shuttlers to win backtoback olympic golds in the mens badminton doubles finals lee yang  and wang chilin ..."""

到目前為止都是非常簡單的字串處理的範疇，其他像是移除表情符號、移除網址之類的，這篇文章用不到就沒有介紹，但其實都不難，大家可以自己試試看。

明天我們再繼續朝著更進階的前處理前進吧！

參考新聞
https://www.taiwannews.com.tw/news/5914017

PS : Github 程式碼還沒整理好，過幾天再放上去好了